About
  • Contact
  • Data & ML

On this page

  • In this notebook, I’ll demonstrate how to use the multilingual language model I created to get the sentiment of newspaper headlines. I’ll also show how to create time-series data according to topics using keywords.
    • I created a new dataset that combines all the newspapers collected for all five countries. I’ve also cleaned this data so that it is ready to use exactly like I do in this notebook. Feel free to use it as you wish. It’s available for download in Dropbox
    • Note: if you have trouble importing the newspapers dataset, see the following tutorial.
  • I’ll visualise some basic descriptive statistics for the dataset
  • 1: How to get the sentiment of a newspaper headline title
    • I’ve trained a multilingual langauge model to predict the sentiment of news headlines. I’ve made the model publicly available at HuggingFace. Huggingface is great because we can now import the language model from any notebook and there is no limit on the number of people who can use it at the same time. If you want to read more about the details of the model, see the following link.
    • In the next cell, I’ll run the same code to import the model. You’ll likely need to install the necessary libraries on your machine or in google colab. These can easily be done by running the following code in a single cell:
    • Now that we have imported the model into our notebook and have created the classifier, we can classify different headlines.
      • I’ll classify the following headline in each of the five languages used. The headline is just a random headline from the dataset:
      • We’ll pass the headline into the classifier in German using
    • The estimated sentiment is negative, and the model is very confident indicating a score of .99xxx. What we’re interested in is the sentiment label negative
    • Let’s try the same thing in English, using the following:
      • The only thing we change is the text, which I simply translated to English using google translate in a browser
  • Worked Example #1
    • In this example, I’ll show how to use the model to get the sentiment of multiple newspaper headlines
    • For the example, we’ll generate a list of headlines at random. The dataset will include ten rows with different headlines
    • I’ll write a function below that will take in a dataset like the one we have above, and return the same dataset but it will include columns for the sentiment of each article title. The only thing you need to do to get the result is to provide the name of the dataset (in the form of the above) in brackets with the name of the function (sentiment_analysis).
    • For example:
      • returns the following:
    • Now we have two new columns with the sentiment (with label and score) and the sentiment label
  • Worked Example #2
    • How do we generate the list of titles we’re interested in?
    • If we want only the “borders” headlines from the UK, we can use the following:
    • If we want only the headlines containing “borders” for UK Times, we can use the following:
    • Once we get the dataset we are interested in based on the keyword, then we can pass in the dataset to the sentiment function above if we are interested in getting the sentiment based on a specific topic.
    • But what if we have lots of keywords that we want to use to identify the newspaper titles?
    • If we wanted to use the five keywords above, the list would look like the following:
    • Putting it all together, we can use the following:
  • Worked Example #3: Attention to different topics
    • Where attention to issue \(i\) at time \(t\) equals the sum of headlines about issue \(i\) dividing by the sum of all headlines.
    • The value of measuring attention in this way is that we can then derive time-series measures that are comparable across newspapers. I’ll write a function below that takes in the following parameters:
    • We’ll call the function get_proportions_by_keyword. Here’s an example using the keyword “border”, the UK Sun, and a monthly frequency. We’ll name the topic “Borders”
    • Now we can take the new dataset and plot monthly attention to borders by the Sun
    • We can similarly get the weekly values as well:
  • Worked Example #4 – comparing attention to different issues by different newspapers
    • In this example, I’ll show how to use the functions from this notebook to compare the attention of different newspapers to different issues. This one is a bit more complex, so feel free to reach out if you have trouble.
    • What if we want to get the attention of the German newspapers to vaccines throughout the pandemic?
  • Worked Example #5: Borders analysis for WP1
    • In this final tutorial, I’ll get dynamic attention to ‘borders’ using keywords, as well as dynamic sentiment. I got a little lazy with adding comments (sorry).
    • How do I use the GPU in Google Colab if I need to get the sentiment for a lot of newspaper headlines?

How to guide – Language model and keyword searches

Author

Zach Dickson

Code
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

In this notebook, I’ll demonstrate how to use the multilingual language model I created to get the sentiment of newspaper headlines. I’ll also show how to create time-series data according to topics using keywords.

Feel free to copy this notebook and use it for yourself. If you do not already have python installed already on your computer and you don’t want to install it (how to check), then you can use this notebook in Google Colab.

This option can be especially appealing if you want to use the language model on lots of text, because Google Colab allows you to use a GPU, which increases the speed of the model significantly. I’ll demonstrate how to do this at the end also.


I created a new dataset that combines all the newspapers collected for all five countries. I’ve also cleaned this data so that it is ready to use exactly like I do in this notebook. Feel free to use it as you wish. It’s available for download in Dropbox


Code
##### if you're using google colab, you'll need to change the file location of the dataset to the appropriate location in order
#### There are a few ways to do this -- one would be to mount your own google drive. Another would be to upload to newspaper file in the Colab notebook in the left hand column.





import pandas as pd # import necessary library
import numpy as np # import necessary library
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default='colab'


pd.set_option('display.max_columns', 50)  ## set max columns to display as 50

df = pd.read_csv('./newspaper_data_all.csv')  ## read in the dataset. See the first comment in this box of code
df.date = pd.to_datetime(df.date)  # set date column to pandas version of date

df.newspaper = df.newspaper.str.replace('_', ' ').str.title() ## these are just changes to the names of the newspapers for presentation purposes
df.country = df.country.str.title().str.replace('Uk','UK') ## these are just changes to the names of the newspapers for presentation purposes

Note: if you have trouble importing the newspapers dataset, see the following tutorial.

## First five rows of the dataset:

Code
df.head()
date link title newspaper country
0 2020-01-01 https://sport.fakt.pl/inne-sporty/david-stern-... Zmarł były komisarz ligi NBA David Stern Fakt Poland
1 2020-01-01 https://www.fakt.pl/polityka/sylwester-polityk... "Grzeczny" sylwester polityków. Tylko Jaki się... Fakt Poland
2 2020-01-01 https://www.fakt.pl/wydarzenia/polska/zaglada-... Zagłada ptaków w Polsce. W środę zabijały je w... Fakt Poland
3 2020-01-01 https://sport.fakt.pl/pilka-nozna/nietypowa-ci... Nietypowa cieszynka piłkarzy Cracovii. Przebra... Fakt Poland
4 2020-01-01 https://www.fakt.pl/plotki/z-kim-ania-z-rolnik... Z kim Ania z „Rolnik szuka żony” spędziła sylw... Fakt Poland

I’ll visualise some basic descriptive statistics for the dataset

Articles per newspaper:

Code
df.groupby('newspaper').size()
newspaper
Abc Spain               50666
Bild                   204491
De Telegraaf           176781
De Welt                 42199
El Mundo                39738
El Pais                 67419
Fakt                    89627
Gazeta Wyborcza         27390
Guardian               149828
Nrc                     49361
Rzeczpospolita          20647
Suddeutsche Zeitung    586495
Uk Sun                 128994
Uk Times                32062
Volkskrant              62579
dtype: int64
Code
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme('notebook')
plt.style.use('fivethirtyeight')

df.groupby('newspaper').size().plot(kind='bar')
plt.title('Number of Newspapers')
Text(0.5, 1.0, 'Number of Newspapers')

Articles per newspaper, per country

Code
df.groupby(['country','newspaper']).size()
country      newspaper          
Germany      Bild                   204491
             De Welt                 42199
             Suddeutsche Zeitung    586495
Netherlands  De Telegraaf           176781
             Nrc                     49361
             Volkskrant              62579
Poland       Fakt                    89627
             Gazeta Wyborcza         27390
             Rzeczpospolita          20647
Spain        Abc Spain               50666
             El Mundo                39738
             El Pais                 67419
UK           Guardian               149828
             Uk Sun                 128994
             Uk Times                32062
dtype: int64
Code
x = pd.DataFrame(df.groupby(['country','newspaper']).size()).reset_index().rename(columns={0:'count'})

plt.figure(figsize=(10,5))

sns.barplot(x='newspaper', y='count', hue='country', data=x)
plt.xticks(rotation=90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 [Text(0, 0, 'Bild'),
  Text(1, 0, 'De Welt'),
  Text(2, 0, 'Suddeutsche Zeitung'),
  Text(3, 0, 'De Telegraaf'),
  Text(4, 0, 'Nrc'),
  Text(5, 0, 'Volkskrant'),
  Text(6, 0, 'Fakt'),
  Text(7, 0, 'Gazeta Wyborcza'),
  Text(8, 0, 'Rzeczpospolita'),
  Text(9, 0, 'Abc Spain'),
  Text(10, 0, 'El Mundo'),
  Text(11, 0, 'El Pais'),
  Text(12, 0, 'Guardian'),
  Text(13, 0, 'Uk Sun'),
  Text(14, 0, 'Uk Times')])


1: How to get the sentiment of a newspaper headline title

I’ve trained a multilingual langauge model to predict the sentiment of news headlines. I’ve made the model publicly available at HuggingFace. Huggingface is great because we can now import the language model from any notebook and there is no limit on the number of people who can use it at the same time. If you want to read more about the details of the model, see the following link.

## example code
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline,TFAutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
model = TFAutoModelForSequenceClassification.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=model, device=0) ## if you're using colab, change the runtime type and add 'device=0' in the parentheses to use a GPU

sentiment_classifier('text we want to get the sentiment for')  ## classifies the text we want to get the sentiment for

In the next cell, I’ll run the same code to import the model. You’ll likely need to install the necessary libraries on your machine or in google colab. These can easily be done by running the following code in a single cell:

pip install transformers
Code
pip install transformers
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 60.1 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.4)
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 33.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 104.7 MB/s eta 0:00:00
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 74.3 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.7.22)
Installing collected packages: safetensors, huggingface-hub, tokenizers, transformers
Successfully installed huggingface-hub-0.17.3 safetensors-0.4.0 tokenizers-0.14.1 transformers-4.34.0
Code


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline,TFAutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
model = TFAutoModelForSequenceClassification.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=model, device=0) ## if you're using colab, change the runtime type and add 'device=0' in the parentheses to use a GPU

Some layers from the model checkpoint at z-dickson/multilingual_sentiment_newspaper_headlines were not used when initializing TFBertForSequenceClassification: ['dropout_75']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at z-dickson/multilingual_sentiment_newspaper_headlines.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.

Now that we have imported the model into our notebook and have created the classifier, we can classify different headlines.

I’ll classify the following headline in each of the five languages used. The headline is just a random headline from the dataset:

In German, the headline is as follows:

  • Düsseldorf: Mann von jungem Messerstecher nach Streit schwer verletzt
  • English: Dusseldorf: Man seriously injured by young knife man after argument

We’ll pass the headline into the classifier in German using

#example
sentiment_classifier('Mann von jungem Messerstecher nach Streit schwer verletzt')

The result is as follows:

Code
sentiment_classifier('Mann von jungem Messerstecher nach Streit schwer verletzt')
[{'label': 'negative', 'score': 0.9974070191383362}]

The estimated sentiment is negative, and the model is very confident indicating a score of .99xxx. What we’re interested in is the sentiment label negative


Let’s try the same thing in English, using the following:

#example
sentiment_classifier('Man seriously injured by young knife man after argument')

The only thing we change is the text, which I simply translated to English using google translate in a browser

The result is nearly identical

Code
sentiment_classifier('Man seriously injured by young knife man after argument')
[{'label': 'negative', 'score': 0.998826801776886}]

Next, I’ll work through some examples that you might want to use in an analysis


Worked Example #1

In this example, I’ll show how to use the model to get the sentiment of multiple newspaper headlines

If you want to know how to get a new dataset of headlines according to a specific search criteria, then see the next worked example.


For the example, we’ll generate a list of headlines at random. The dataset will include ten rows with different headlines

here’s the dataset:

Code


headline_example = df.sample(10)

headline_example
date link title newspaper country
1649067 2021-11-03 https://www.bild.de/sport/mehr-sport/baseball/... Baseball: Atlanta Braves gewinnen World Series... Bild Germany
1125040 2021-01-21 https://www.sueddeutsche.de/sport/basketball-t... Theis-Gala reicht Celtics nicht zum Sieg über ... Suddeutsche Zeitung Germany
1345511 2021-12-23 https://www.sueddeutsche.de/politik/italien-pa... Fürsorgliches Italien Suddeutsche Zeitung Germany
784752 2021-05-06 /economia/2021-05-06/peaje-en-las-autovias-fon... Peajes en las autovías, fondo para los ERTE y ... El Pais Spain
646337 2022-06-25 https://www.telegraaf.nl/nieuws/1431146744/lim... Limburgse pastoor ’die seksfilmpje toonde’ nie... De Telegraaf Netherlands
1030380 2020-08-23 https://www.sueddeutsche.de/muenchen/starnberg... Bermuda-Dreieck in der Mittelkonsole Suddeutsche Zeitung Germany
556442 2021-02-12 https://www.telegraaf.nl/sport/1236508470/dubb... Dubbel schaatsgoud voor Oranje: ’We moesten wa... De Telegraaf Netherlands
1155656 2021-03-01 https://www.sueddeutsche.de/muenchen/erding/er... Online-Diskussion zur Seenotrettung Suddeutsche Zeitung Germany
290975 2022-06-27 https://www.theguardian.com/world/2022/jun/27/... Nato to put 300,000 troops on high alert in re... Guardian UK
1259656 2021-08-21 https://www.sueddeutsche.de/panorama/kriminali... Vandalen verwüsten Kinder- und Jugendzentrum i... Suddeutsche Zeitung Germany

I’ll write a function below that will take in a dataset like the one we have above, and return the same dataset but it will include columns for the sentiment of each article title. The only thing you need to do to get the result is to provide the name of the dataset (in the form of the above) in brackets with the name of the function (sentiment_analysis).

For example:

#example
sentiment_analysis(headline_example)

returns the following:

Code
from logging import raiseExceptions


def sentiment_analysis(dataset, date_start=None, date_end=None):
  x = dataset.copy()
  try:
    if date_start != None:
      x.date = pd.to_datetime(x.date)
      x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
    sentiment_list = []
    sentiment = sentiment_classifier(x.title.to_list())
    sentiment_list.append(sentiment)
    x['sentiment'] = sentiment
    x['sentiment_label'] = x.sentiment.apply(lambda x: x['label'])
    return x
  except TypeError:
    print("ERROR: ensure date_start and date_end are in the format Year-Month-Day. For example: 2020-01-08")


Code
sentiment_analysis(headline_example)
date link title newspaper country sentiment sentiment_label
1649067 2021-11-03 https://www.bild.de/sport/mehr-sport/baseball/... Baseball: Atlanta Braves gewinnen World Series... Bild Germany {'label': 'positive', 'score': 0.6868038177490... positive
1125040 2021-01-21 https://www.sueddeutsche.de/sport/basketball-t... Theis-Gala reicht Celtics nicht zum Sieg über ... Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.7599111199378967} neutral
1345511 2021-12-23 https://www.sueddeutsche.de/politik/italien-pa... Fürsorgliches Italien Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.8194842338562012} neutral
784752 2021-05-06 /economia/2021-05-06/peaje-en-las-autovias-fon... Peajes en las autovías, fondo para los ERTE y ... El Pais Spain {'label': 'neutral', 'score': 0.6340370178222656} neutral
646337 2022-06-25 https://www.telegraaf.nl/nieuws/1431146744/lim... Limburgse pastoor ’die seksfilmpje toonde’ nie... De Telegraaf Netherlands {'label': 'neutral', 'score': 0.8677178025245667} neutral
1030380 2020-08-23 https://www.sueddeutsche.de/muenchen/starnberg... Bermuda-Dreieck in der Mittelkonsole Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.9753379821777344} neutral
556442 2021-02-12 https://www.telegraaf.nl/sport/1236508470/dubb... Dubbel schaatsgoud voor Oranje: ’We moesten wa... De Telegraaf Netherlands {'label': 'neutral', 'score': 0.976171612739563} neutral
1155656 2021-03-01 https://www.sueddeutsche.de/muenchen/erding/er... Online-Diskussion zur Seenotrettung Suddeutsche Zeitung Germany {'label': 'positive', 'score': 0.568717896938324} positive
290975 2022-06-27 https://www.theguardian.com/world/2022/jun/27/... Nato to put 300,000 troops on high alert in re... Guardian UK {'label': 'negative', 'score': 0.9975405931472... negative
1259656 2021-08-21 https://www.sueddeutsche.de/panorama/kriminali... Vandalen verwüsten Kinder- und Jugendzentrum i... Suddeutsche Zeitung Germany {'label': 'negative', 'score': 0.9977375268936... negative

Now we have two new columns with the sentiment (with label and score) and the sentiment label


Worked Example #2

How do we generate the list of titles we’re interested in?

Say that we wanted to get all the headlines related to “borders” – I’ll create a function that will help you get the titles that are related to borders below.

The function will take in up to four parameters. The first two parameters are as follows:

  • dataset – the dataset that you are passing in to the fuction. This will likely just be the entire newspaper dataset
  • keyword – this is the keyword you want to use to identify the titles with. In our example, the keyword is borders

I’ll also include two optional parameters (‘optional’ here means that we can just ignore them and pass in values for only the two parameters above if we want)

  • country – this parameter narrows down the results to just a specific country
  • newspaper – this parameter narrows down the results to a specific newspaper

#### Note: the values that you pass into the function must match the options in the dataset verbatim. The country and newspaper options are presented below for the dataset that we’ve been working with throughout this notebook

Countries:

Code
df.country.unique()
array(['Poland', 'UK', 'Netherlands', 'Spain', 'Germany'], dtype=object)

Newspapers:

Code
df.newspaper.unique()
array(['Fakt', 'Rzeczpospolita', 'Gazeta Wyborcza', 'Uk Times',
       'Guardian', 'Uk Sun', 'Nrc', 'De Telegraaf', 'Volkskrant',
       'El Mundo', 'El Pais', 'Abc Spain', 'Suddeutsche Zeitung',
       'De Welt', 'Bild'], dtype=object)
Code
### I wrote this function so you don't have to! A function takes in different parameters and outputs the data we want. Don't worry about this code, just run the cell and skip to the next line


def get_titles_by_keyword(dataset, keywords, country=None, newspaper=None, date_start=None, date_end=None):
  ndf = pd.DataFrame()
  x = dataset.copy()
  if newspaper != None:
      x = x[x.newspaper == newspaper]
  if country != None:
      x = x[x.country == country]
  if date_start != None:
      x.date = pd.to_datetime(x.date)
      x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
  if type(keywords) == str:
      keywords = [keywords]
  for keyword in keywords:
      xx = x[x.title.str.contains(keyword, case=False)]
      ndf = pd.concat([ndf, xx])
  if len(ndf) > 0:
      return ndf
  else:
      print('Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?')
      print(' --- ')
      print('The optional newspapers are: ' + str(df.newspaper.unique()))

The function is named get_titles_by_keyword.

We can call the function, with the keyword borders using the following

#example
get_titles_by_keyword(df, 'border')

The result will appear as follows:

Code
get_titles_by_keyword(df, 'border', newspaper = 'Uk Times', date_start = '2020-03-09', date_end='2022-04-05')
date link title newspaper country
110970 2020-03-16 https://www.thetimes.co.uk/article/greece-accu... Greece accuses Turkey of sending fake migrants... Uk Times UK
111030 2020-03-23 https://www.thetimes.co.uk/article/canadas-ind... Coronavirus: Native Americans close their bord... Uk Times UK
111066 2020-03-26 https://www.thetimes.co.uk/article/quest-ends-... Quest ends at border after 2 years and 14,000 ... Uk Times UK
111458 2020-05-10 https://www.thetimes.co.uk/article/chinese-and... Chinese and Indian troops injured in border br... Uk Times UK
111776 2020-06-16 https://www.thetimes.co.uk/article/indian-troo... 20 Indian troops die in border brawl with Chin... Uk Times UK
... ... ... ... ... ...
140289 2021-11-12 https://www.thetimes.co.uk/article/french-allo... Polish border crisis will fuel Channel migrant... Uk Times UK
140316 2021-11-15 https://www.thetimes.co.uk/article/britain-and... France and Germany join Britain’s pledge to st... Uk Times UK
140878 2022-01-18 https://www.thetimes.co.uk/article/satellite-i... Tonga volcano relief efforts hampered as borde... Uk Times UK
141119 2022-02-15 https://www.thetimes.co.uk/article/joe-biden-d... Joe Biden doubts Russia is withdrawing from Uk... Uk Times UK
141349 2022-03-13 https://www.thetimes.co.uk/article/russian-air... Russian airstrikes hit Ukraine military base n... Uk Times UK

105 rows × 5 columns

If we want only the “borders” headlines from the UK, we can use the following:

#example
get_titles_by_keyword(df, 'border', country = 'UK')

Which results in the following:

Code
get_titles_by_keyword(df, 'border', country = 'UK')
date link title newspaper country
110554 2020-01-30 https://www.thetimes.co.uk/article/longest-smu... Longest smuggling tunnel found under US-Mexico... Uk Times UK
110645 2020-02-09 https://www.thetimes.co.uk/article/syria-refug... Syria refugees: ‘Borders won’t stop us if Assa... Uk Times UK
110653 2020-02-10 https://www.thetimes.co.uk/article/trump-cuts-... Trump cuts aid to fund military and border wal... Uk Times UK
110702 2020-02-15 https://www.thetimes.co.uk/article/crash-landi... Crash Landing On You: Cross-border romantic co... Uk Times UK
110787 2020-02-25 https://www.thetimes.co.uk/article/angela-merk... Contender to succeed Angela Merkel vows to tak... Uk Times UK
... ... ... ... ... ...
419625 2016-08-09 https://www.thesun.co.uk/news/1581942/ukraine-... WAR TENSIONS Ukraine warns Russian invasion po... Uk Sun UK
419963 2016-07-27 https://www.thesun.co.uk/news/1509975/north-ko... SNAKE-SPONSORED TERRORISM North Korea accuses ... Uk Sun UK
420187 2016-07-17 https://www.thesun.co.uk/news/1457645/gunman-s... armenia opposition siege Gunmen storm police H... Uk Sun UK
420506 2016-06-30 https://www.thesun.co.uk/news/1370262/vladimir... WORLD WAR THREE FEARS Vladimir Putin threatens... Uk Sun UK
420679 2016-06-22 https://www.thesun.co.uk/news/1321708/eu-has-s... MIGRANT SHAMBLES EU 'has surrendered complete ... Uk Sun UK

1199 rows × 5 columns

If we want only the headlines containing “borders” for UK Times, we can use the following:

#example
get_titles_by_keyword(df, 'border', newspaper = 'Uk Times')

Which results in the following:

Code
get_titles_by_keyword(df, 'impfung', newspaper = 'Bild')
date link title newspaper country
1497255 2020-01-05 https://www.bild.de/regional/leipzig/leipzig-n... Ab März Pflicht: 20 000 Kinder noch ohne Maser... Bild Germany
1506852 2020-02-13 https://www.bild.de/regional/bremen/bremen-akt... Windpocken-Alarm! Schul-Verbot für Kinder ohn... Bild Germany
1511670 2020-03-05 https://www.bild.de/bild-plus/ratgeber/2020/ra... Impfungen, die Sie nicht kennen – aber brauchen! Bild Germany
1515570 2020-03-23 https://www.bild.de/ratgeber/2020/ratgeber/kan... Pneumokokken-Impfung: Was bringt sie bei Corona? Bild Germany
1519010 2020-04-07 https://www.bild.de/unterhaltung/leute/leute/w... Wegen Verunglimpfung: Droht dem „Überläufer“ e... Bild Germany
... ... ... ... ... ...
1695518 2022-06-05 https://www.bild.de/ratgeber/gesundheit/gesund... Schmerzhafte Krankheit: Wer braucht eine Impfu... Bild Germany
1696249 2022-06-09 https://www.bild.de/politik/2022/politik/affen... Affenpocken: Stiko empfiehlt Impfung für Risik... Bild Germany
1698781 2022-06-21 https://www.bild.de/bild-plus/ratgeber/2022/ra... Omikron-Subvariante BA.5: Brauche ich jetzt do... Bild Germany
1699254 2022-06-23 https://www.bild.de/regional/stuttgart/stuttga... 800 000 Spritzen pro Woche - Land bereitet Meg... Bild Germany
1700691 2022-06-30 https://www.bild.de/bild-plus/ratgeber/2022/ra... Experten in BILD: Wie es mit den Corona-Impfun... Bild Germany

872 rows × 5 columns

Code
sentiment_classifier('Corona-Ausbruch: Sieben Heimbewohner trotz Impfung infiziert')
[{'label': 'negative', 'score': 0.9988634586334229}]

Once we get the dataset we are interested in based on the keyword, then we can pass in the dataset to the sentiment function above if we are interested in getting the sentiment based on a specific topic.

But what if we have lots of keywords that we want to use to identify the newspaper titles?

Say I want to use multiple keywords related to “borders”. A list of related keywords might look like the following:

  • border
  • bounds
  • confine
  • boundary
  • perimeter

We can use the same function by passing in a list of keywords. A list in python takes the following form:

['keyword1', 'keyword2', 'keyword3', 'etc']

You can pass in as many keywords as you want, but they need to be separated by a comma and in square brackets

NOTE: here’s a short tutorial on lists in case you get stuck.


If we wanted to use the five keywords above, the list would look like the following:

['border','bounds','confine','boundary','perimeter']

Putting it all together, we can use the following:

#example
get_titles_by_keyword(dataset = df, keywords = ['border','bounds','confine','boundary','perimeter'])

Which returns the following:

Code
get_titles_by_keyword(dataset = df, keywords = ['border','bounds','confine','boundary','perimeter'])
date link title newspaper country
110554 2020-01-30 https://www.thetimes.co.uk/article/longest-smu... Longest smuggling tunnel found under US-Mexico... Uk Times UK
110645 2020-02-09 https://www.thetimes.co.uk/article/syria-refug... Syria refugees: ‘Borders won’t stop us if Assa... Uk Times UK
110653 2020-02-10 https://www.thetimes.co.uk/article/trump-cuts-... Trump cuts aid to fund military and border wal... Uk Times UK
110702 2020-02-15 https://www.thetimes.co.uk/article/crash-landi... Crash Landing On You: Cross-border romantic co... Uk Times UK
110787 2020-02-25 https://www.thetimes.co.uk/article/angela-merk... Contender to succeed Angela Merkel vows to tak... Uk Times UK
... ... ... ... ... ...
326780 2016-09-12 https://www.thesun.co.uk/news/1767339/jeremy-c... corbyn & gone Jez and George Osborne join Labo... Uk Sun UK
327193 2016-08-08 https://www.thesun.co.uk/news/1574143/tories-c... MAY'S MASSIVE MAJORITY Tories could win '90-se... Uk Sun UK
344090 2021-08-29 https://www.thesun.co.uk/news/16001089/homeown... ExclusiveGRAVE INSULT Homeowner 'desecrated wa... Uk Sun UK
411786 2017-03-06 https://www.thesun.co.uk/news/3019037/arrested... CROSSING THE BOUNDARY Perv arrested after 'sho... Uk Sun UK
323734 2017-03-29 https://www.thesun.co.uk/news/3205230/two-majo... SECURITY REVIEW Major reviews of Parliament se... Uk Sun UK

1331 rows × 5 columns

Worked Example #3: Attention to different topics

There are different numbers of newspaper titles across countries and newspapers. This poses a challenge for comparisons. One way to get around this is to measure “attention” to a given topic as a proportion of attention to all topics. For example, if the Sun has 5 headlines about immigration, we can get the Sun’s “attention to immigration” by dividing 5 by the number of total newspaper headlines from the Sun in a given period. If the Sun has 100 articles for the day, and 5 are about immigration, then the Sun’s attention to immigration is .05 or 5%. This allows us to compare the relative emphasis a single newspaper devotes to a single (or multiple) topics. This measurement of attention extends from the Baumgartner & Jones. I’ve also written about the same method using political texts and computational methods here.


\[Attention_{i,t} = \frac{\sum headlines_{i,t}}{\sum headlines_{t}}\]

Where attention to issue \(i\) at time \(t\) equals the sum of headlines about issue \(i\) dividing by the sum of all headlines.


The value of measuring attention in this way is that we can then derive time-series measures that are comparable across newspapers. I’ll write a function below that takes in the following parameters:

  • dataset – can be the entire newspaper dataset
  • keywords – can be a single keyword or a list of keywords like shown above
  • newspaper – name of newspaper of interest
  • frequency – this takes the values of W (weekly), M (monthly), and Y (yearly) and is the interval at which attention is calculated
  • topic – the name of the topic that pertains to the keywords (this can be anything, but the function will return a column with this name for the attention values)

#newspapers:
['Fakt', 'Rzeczpospolita', 'Gazeta Wyborcza', 'Uk Times',
       'Guardian', 'Uk Sun', 'Nrc', 'De Telegraaf', 'Volkskrant',
       'El Mundo', 'El Pais', 'Abc Spain', 'Suddeutsche Zeitung',
       'De Welt', 'Bild']
Code
### I wrote this function so you don't have to! A function takes in different parameters and outputs the data we want. Don't worry about this code, just run the cell and skip to the next line



## get proportion of titles that contain keywords out of total number of titles

def get_titles_by_keyword(dataset, keywords, country=None, newspaper=None, date_start=None, date_end=None):
  ndf = pd.DataFrame()
  x = dataset.copy()
  if newspaper != None:
      x = x[x.newspaper == newspaper]
  if country != None:
      x = x[x.country == country]
  if date_start != None:
      x.date = pd.to_datetime(x.date)
      x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
  if type(keywords) == str:
      keywords = [keywords]
  for keyword in keywords:
      xx = x[x.title.str.contains(keyword, case=False)]
      ndf = pd.concat([ndf, xx])
  if len(ndf) > 0:
      return ndf
  else:
      print('Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?')
      print(' --- ')
      print('The optional newspapers are: ' + str(df.newspaper.unique()))


def get_proportion_by_keyword(dataset, keywords, newspaper, frequency, topic, date_start=None, date_end=None):
  x = dataset.copy()
  x.date = pd.to_datetime(x.date)
  x = x.query('date >= "2020-01-01" and date <= "2022-05-31"')
  if newspaper != None:
    x = x.loc[x.newspaper == newspaper]
  if date_start != None:
    x.date = pd.to_datetime(x.date)
    x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
  try:
    x = pd.DataFrame(x.groupby(pd.Grouper(key='date', freq=frequency)).size()).reset_index().rename(columns={0:'total_headlines'})
    titles_by_keyword = get_titles_by_keyword(dataset, keywords, newspaper=newspaper)
    titles_by_keyword = pd.DataFrame(titles_by_keyword.groupby(pd.Grouper(key='date', freq=frequency)).size()).reset_index().rename(columns={0:'issue_headlines'})
    x = x.merge(titles_by_keyword, on='date', how='left')
    x.issue_headlines = x.issue_headlines.fillna(0)
    x['attention'] = x.issue_headlines/x.total_headlines
    x['topic'] = topic
    x['newspaper'] = newspaper
    return x
  except AttributeError:
    print('')


Code
get_proportion_by_keyword(df, ['immigrant','migrant'], newspaper='Uk Times', topic = 'immigration', frequency = 'M', date_start='2020-01-01', date_end = '2021-01-01')
date total_headlines issue_headlines attention topic newspaper
0 2020-01-31 1110 3 0.002703 immigration Uk Times
1 2020-02-29 1062 10 0.009416 immigration Uk Times
2 2020-03-31 1144 13 0.011364 immigration Uk Times
3 2020-04-30 1066 5 0.004690 immigration Uk Times
4 2020-05-31 1077 9 0.008357 immigration Uk Times
5 2020-06-30 1069 1 0.000935 immigration Uk Times
6 2020-07-31 1082 3 0.002773 immigration Uk Times
7 2020-08-31 1138 13 0.011424 immigration Uk Times
8 2020-09-30 1091 9 0.008249 immigration Uk Times
9 2020-10-31 1079 9 0.008341 immigration Uk Times
10 2020-11-30 1047 4 0.003820 immigration Uk Times
11 2020-12-31 1074 2 0.001862 immigration Uk Times
12 2021-01-31 34 5 0.147059 immigration Uk Times

We’ll call the function get_proportions_by_keyword. Here’s an example using the keyword “border”, the UK Sun, and a monthly frequency. We’ll name the topic “Borders”

Here’s the code to do the above:

# example
get_proportion_by_keyword(df, 'border', 'Uk Sun', 'M', 'Borders')

which returns the following

Code
get_proportion_by_keyword(df, 'border', 'Uk Sun', 'M', 'Borders')
date total_headlines issue_headlines attention topic newspaper
0 2020-01-31 1095 2 0.001826 Borders Uk Sun
1 2020-02-29 872 3 0.003440 Borders Uk Sun
2 2020-03-31 1101 11 0.009991 Borders Uk Sun
3 2020-04-30 1240 1 0.000806 Borders Uk Sun
4 2020-05-31 1155 4 0.003463 Borders Uk Sun
5 2020-06-30 1094 14 0.012797 Borders Uk Sun
6 2020-07-31 1203 6 0.004988 Borders Uk Sun
7 2020-08-31 2260 6 0.002655 Borders Uk Sun
8 2020-09-30 2419 8 0.003307 Borders Uk Sun
9 2020-10-31 2295 6 0.002614 Borders Uk Sun
10 2020-11-30 2377 8 0.003366 Borders Uk Sun
11 2020-12-31 2301 15 0.006519 Borders Uk Sun
12 2021-01-31 2384 21 0.008809 Borders Uk Sun
13 2021-02-28 2372 10 0.004216 Borders Uk Sun
14 2021-03-31 2741 9 0.003283 Borders Uk Sun
15 2021-04-30 2596 14 0.005393 Borders Uk Sun
16 2021-05-31 2328 10 0.004296 Borders Uk Sun
17 2021-06-30 2318 0 0.000000 Borders Uk Sun
18 2021-07-31 2264 5 0.002208 Borders Uk Sun
19 2021-08-31 2026 4 0.001974 Borders Uk Sun
20 2021-09-30 2035 6 0.002948 Borders Uk Sun
21 2021-10-31 1972 2 0.001014 Borders Uk Sun
22 2021-11-30 1852 7 0.003780 Borders Uk Sun
23 2021-12-31 1694 6 0.003542 Borders Uk Sun
24 2022-01-31 1815 7 0.003857 Borders Uk Sun
25 2022-02-28 1659 16 0.009644 Borders Uk Sun
26 2022-03-31 1780 6 0.003371 Borders Uk Sun
27 2022-04-30 1425 5 0.003509 Borders Uk Sun
28 2022-05-31 1506 6 0.003984 Borders Uk Sun

Now we can take the new dataset and plot monthly attention to borders by the Sun

Code
x = get_proportion_by_keyword(df, 'border', 'Uk Sun', 'M', 'Borders')
pio.renderers.default='colab'
#sns.relplot(x='date', y='attention', hue='topic', data=x, kind='line', aspect=2) ## same plot in seaborn

fig = px.line(x, x="date", y="attention", color='topic',markers=True, title='Attention to borders by UK Sun', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='plotly_white',
    showlegend=True,
    xaxis_title = 'Date',
    width=1200)
fig.show(include_plotlyjs=True)

We can similarly get the weekly values as well:

Code
x = get_proportion_by_keyword(df, 'border', 'Uk Sun', 'W', 'Borders')

#sns.relplot(x='date', y='attention', hue='topic', data=x, kind='line', aspect=2)

fig = px.line(x, x="date", y="attention", color='topic',markers=True, title='Attention to borders by UK Sun', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='plotly_white',
    showlegend=True,
    xaxis_title = 'Date',
    width=1200)
fig.show(include_plotlyjs=True)

Worked Example #4 – comparing attention to different issues by different newspapers

In this example, I’ll show how to use the functions from this notebook to compare the attention of different newspapers to different issues. This one is a bit more complex, so feel free to reach out if you have trouble.


Say I want to get the weekly attention of all the German newspapers to vaccines during the pandemic. I also want to compare how the sentiment differed between the newspapers.

First I need to identify the keywords related to vaccines. I’ll use a few for each language and store them as lists:

## example:
german_keywords = ['impfung', 'impfimpfung', 'impfen']
Code

german_keywords = ['impfung', 'impfimpfung', 'impfen']

Then I can create new datasets for all the newspapers in Germany. I’ll do this by using the get_titles_by_keyword function and pass in the country so that it returns all the German newspapers:

#example:
get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')

which provides the following:

Code
get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')
date link title newspaper country
867760 2022-12-29 https://www.sueddeutsche.de/bayern/bayern-coro... Bilanz : Bayerns Impfzentren schließen zum Jah... Suddeutsche Zeitung Germany
867804 2022-12-08 https://www.sueddeutsche.de/gesundheit/china-c... Sars-CoV-2 : Heftige Infektionswelle in China ... Suddeutsche Zeitung Germany
867873 2022-11-22 https://www.sueddeutsche.de/politik/impfpflich... Corona-Pandemie : Pfleger und Ärzte brauchen k... Suddeutsche Zeitung Germany
867881 2022-11-17 https://www.sueddeutsche.de/gesundheit/kinder-... Stiko-Empfehlung : Gesunde Kleinkinder brauche... Suddeutsche Zeitung Germany
867885 2022-11-16 https://www.sueddeutsche.de/gesundheit/guertel... Nebenwirkungen der Covid-Impfung? : Verdachtsf... Suddeutsche Zeitung Germany
... ... ... ... ... ...
1684082 2022-04-13 https://www.bild.de/bild-plus/ratgeber/psychol... Psychische Form von Gewalt: Schweigen ist grau... Bild Germany
1693094 2022-05-24 https://www.bild.de/ratgeber/2022/ratgeber/cov... Covid19-Empfehlung der Stiko - Gesunde Kinder ... Bild Germany
1696690 2022-06-11 https://www.bild.de/bild-plus/ratgeber/kind-fa... Kinder: Was Sie nach dem Schimpfen immer tun s... Bild Germany
1698376 2022-06-19 https://www.bild.de/ratgeber/2022/ratgeber/usa... USA impfen Kleinste gegen Corona - Corona-Piks... Bild Germany
1699624 2022-06-25 https://www.bild.de/regional/leipzig/leipzig-n... Impfen: Neuer Stichtags-Ärger bei Kliniken und... Bild Germany

5218 rows × 5 columns

I’ll store that dataset of german headlines about vaccines with a new variable called: germany_vaccines

#example
germany_vaccines = get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')

We’ll then use the new dataset to get the sentiment of each article using the sentiment_analysis function from above and storing the dataset as another variable called sentiment_germany_vaccines:

#example
sentiment_germany_vaccines = sentiment_analysis(germany_vaccines)

This will take a few minutes, because we are using the language model to classify each of the 5,000+ headlines about vaccines in Germany.

Here’s what we’ll get if we view that new dataset:

Code
germany_vaccines = get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')

germany_vaccines = germany_vaccines.sample(1000) ## we'll only use a sample of 1000 so it doesn't take a long time to classify all the headlines

sentiment_germany_vaccines = sentiment_analysis(germany_vaccines)
Code
sentiment_germany_vaccines
date link title newspaper country sentiment sentiment_label
1246095 2021-07-30 https://www.sueddeutsche.de/muenchen/fuerstenf... Impfen zeigt deutliche Wirkung Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.5086002349853516} neutral
1310215 2021-10-26 https://www.sueddeutsche.de/gesundheit/gesundh... USA-Reisen ab November möglich nach Impfung Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.5342729687690735} neutral
1332218 2021-12-02 https://www.sueddeutsche.de/muenchen/fuerstenf... Die Impfung wirkt Suddeutsche Zeitung Germany {'label': 'positive', 'score': 0.6801331639289... positive
1315560 2021-11-01 https://www.sueddeutsche.de/gesundheit/gesundh... Auffrischungsimpfungen in den meisten Heimen z... Suddeutsche Zeitung Germany {'label': 'neutral', 'score': 0.7856541275978088} neutral
1335496 2021-11-29 https://www.sueddeutsche.de/gesundheit/gesundh... Polizei startet mit Corona-Auffrischungsimpfungen Suddeutsche Zeitung Germany {'label': 'negative', 'score': 0.8802963495254... negative
... ... ... ... ... ... ... ...
1577247 2020-12-18 https://www.bild.de/bild-plus/ratgeber/2020/ra... Fragen zum Impfplan - Bin ich dick genug für e... Bild Germany {'label': 'neutral', 'score': 0.9369633793830872} neutral
1583904 2021-01-15 https://www.bild.de/regional/muenchen/muenchen... Gesundheitsminister Holetschek - „Das Impfen i... Bild Germany {'label': 'positive', 'score': 0.5605814456939... positive
1148468 2021-02-25 https://www.sueddeutsche.de/gesundheit/gesundh... Sieben-Tage-Inzidenz sinkt nur langsam: 168.00... Suddeutsche Zeitung Germany {'label': 'negative', 'score': 0.993061363697052} negative
1251331 2021-08-06 https://www.sueddeutsche.de/gesundheit/kranken... Corona-Auffrischimpfung: Krankenhäuser bereite... Suddeutsche Zeitung Germany {'label': 'positive', 'score': 0.7920066118240... positive
1586770 2021-01-27 https://www.bild.de/news/inland/news-inland/di... Diese Rentner wollen endlich ihre Corona-Impfu... Bild Germany {'label': 'positive', 'score': 0.7067196369171... positive

1000 rows × 7 columns

If we want to see the totals of how many positive, negative and neutral headlines about vaccines by newspapers:

Code
sentiment_germany_vaccines.groupby(['newspaper', 'sentiment_label']).size().reset_index()
newspaper sentiment_label 0
0 Bild negative 95
1 Bild neutral 85
2 Bild positive 59
3 De Welt negative 32
4 De Welt neutral 33
5 De Welt positive 8
6 Suddeutsche Zeitung negative 189
7 Suddeutsche Zeitung neutral 281
8 Suddeutsche Zeitung positive 218
Code
sentiment_germany_vaccines.groupby(['newspaper', 'sentiment_label']).size().unstack().plot(kind='bar', stacked=True, figsize=(8,4))
<Axes: xlabel='newspaper'>

What if we want to get the attention of the German newspapers to vaccines throughout the pandemic?

For that, we just need to pass in the german_keywords to the get_proportion_by_keywords function. Because the function requires the newspapaer name, we’ll create three separate datasets for each of the German newspapers:

Remember, the German newspapers are: Suddeutsche Zeitung, De Welt, and Bild

attention_SZ = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Suddeutsche Zeitung', frequency = 'M', topic = 'Vaccines')
attention_DW = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'De Welt', frequency = 'M', topic = 'Vaccines')
attention_Bild = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Bild', frequency = 'M', topic = 'Vaccines')

Now we can plot the attention to vaccines by any of the newspapers:

Code
attention_SZ = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Suddeutsche Zeitung', frequency = 'M', topic = 'Vaccines')
attention_DW = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'De Welt', frequency = 'M', topic = 'Vaccines')
attention_Bild = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Bild', frequency = 'M', topic = 'Vaccines')
Code
#sns.relplot(x='date', y='attention', hue='newspaper', data=attention_SZ, kind='line', aspect=2)

fig = px.line(attention_SZ, x="date", y="attention", color='newspaper',markers=True, title='Attention', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='plotly_white',
    showlegend=True,
    xaxis_title = 'Date',
    width=1200)
fig.show(include_plotlyjs=True)

What if we want to plot them all in the same graph?

We can combine the datasets and then re-create the same plot:

# combine dataset:
german_newspapers_combined = pd.concat([attention_SZ, attention_DW, attention_Bild])
# reset index
german_newspapers_combined.reset_index(inplace=True)
# recreate plot:
sns.relplot(x='date', y='attention', hue='newspaper', data = german_newspapers_combined, kind='line', aspect=2)
# add title (optional)
plt.title('Attention to Vaccines in German Newspapers')

which gives the following result:

Code
german_newspapers_combined = pd.concat([attention_SZ, attention_DW, attention_Bild])
## reset index
german_newspapers_combined.reset_index(inplace=True)
#recreate plot:
fig = px.line(german_newspapers_combined, x="date", y="attention", color='newspaper',markers=True, title='Attention to Vaccines in German Newspapers', height=500)

fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='plotly_white',)
fig.show(include_plotlyjs=True)

Worked Example #5: Borders analysis for WP1

In this final tutorial, I’ll get dynamic attention to ‘borders’ using keywords, as well as dynamic sentiment. I got a little lazy with adding comments (sorry).

Code
german = ['grenz','schengen','Reisebeschränkung', 'Reiseverbot', 'Einreiseverbot', 'Mobilitätsbeschränkung',  'Schlagbaum']
polish = ['granica', 'Schengen', 'ograniczenie', 'zakaz',  'podróży', 'Zakazy', 'mobilności', 'szlaban']
spanish = ['front','schengen','restriccion','prohibición', 'prohibiciones','movilidad']
dutch = ['grens','schengen','reisbeperking', 'reisverbod', 'inreisverbod', 'mobilititeitsbeperking',  'slagboom']
english = ['border','schengen','restriction','prohibition', 'prohibitions','mobility','travel ban','entry ban']

dic = {'Poland': 'polish', 'Germany': 'german', 'Spain': 'spanish', 'Netherlands': 'dutch', 'UK': 'english'}
df['language'] = df['country'].map(dic)
Code
languages = ['german', 'polish', 'spanish', 'dutch', 'english']


attn= pd.DataFrame()

for language in languages:
    for newspaper in df.loc[df.language == language].newspaper.unique():
        try:
            attention = get_proportion_by_keyword(df, keywords = eval(language), newspaper = newspaper, frequency = 'M', topic = 'Borders')
            attention['country'] = language
            attn = pd.concat([attn, attention])
        except TypeError:
            pass


attn=attn.reset_index(drop=True)
lang_to_country = {'german': 'Germany', 'polish': 'Poland', 'spanish': 'Spain', 'dutch': 'Netherlands', 'english': 'UK'}
attn['country'] = attn['country'].map(lang_to_country)
Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?
 --- 
The optional newspapers are: ['Fakt' 'Rzeczpospolita' 'Gazeta Wyborcza' 'Uk Times' 'Guardian' 'Uk Sun'
 'Nrc' 'De Telegraaf' 'Volkskrant' 'El Mundo' 'El Pais' 'Abc Spain'
 'Suddeutsche Zeitung' 'De Welt' 'Bild']
Code
x = attn.groupby(['country','date']).mean().reset_index()

fig = px.line(x, x="date", y="attention", color='country',markers=True, title='Attention to Borders in European Newspapers', height=500)

fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='seaborn',
)


fig.show(include_plotlyjs=True)

#sns.relplot(x='date', y='attention', hue='country', data=attn, kind='line', aspect=2)
#plt.title('Attention to Borders in European Newspapers')
FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Code
### Let's get the sentiment of the headlines that contain the keywords over time


languages = ['german', 'polish', 'spanish', 'dutch', 'english']


attn1= pd.DataFrame()

for newspaper in df.newspaper.unique():
    try:
        lang = df.loc[df.newspaper == newspaper].language.unique()[0]
        attention = get_titles_by_keyword(df, keywords = eval(lang), newspaper = newspaper)
        attention['newspaper'] = newspaper
        attn1 = pd.concat([attn1, attention])
    except TypeError:
        pass
Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?
 --- 
The optional newspapers are: ['Fakt' 'Rzeczpospolita' 'Gazeta Wyborcza' 'Uk Times' 'Guardian' 'Uk Sun'
 'Nrc' 'De Telegraaf' 'Volkskrant' 'El Mundo' 'El Pais' 'Abc Spain'
 'Suddeutsche Zeitung' 'De Welt' 'Bild']
Code
attn1 = attn1.reset_index(drop=True)

attn1 = attn1.sample(2000, replace=True) ### we'll only take a sample of 1000 of these headlines so it doesn't take too long

attn1 = sentiment_analysis(attn1)
Code
attn1['sentiment_centered'] = pd.Categorical(attn1.sentiment_label, categories=['negative', 'neutral', 'positive'], ordered=True).codes - 1
Code
attn1.groupby(['newspaper', 'sentiment_label']).size().unstack().plot(kind='bar', stacked=True, figsize=(8,4))
plt.title('Sentiment of Headlines about Borders in European Newspapers')
Text(0.5, 1.0, 'Sentiment of Headlines about Borders in European Newspapers')

Code
attn2 = attn1.groupby(['newspaper','country', pd.Grouper(key='date', freq='1M')]).mean().reset_index()
attn2 = attn2.query('date > "2020-01-01" and date < "2022-05-31"')
FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Code
x = attn2.groupby(['country', 'date']).mean().reset_index()

x['Sentiment'] = x.sentiment_centered
x['Country'] = x.country

fig = px.line(x, x='date',y='Sentiment',color='Country',markers=True, title='Dynamic Sentiment of Headlines about Borders in European Newspapers', height=500,
              color_discrete_sequence=px.colors.qualitative.T10)
fig.update_layout(
    font_family="Courier New",
    font_color="white",
    title_font_family="Courier New",
    title_font_color="white",
    font_size = 18,
    legend_title_font_color="white",
    template='plotly_white'
)

fig.update_layout(
    font_family="Courier New",
    font_color="black",
    title_font_family="Courier New",
    title_font_color="black",
    font_size = 18,
    legend_title_font_color="black",
    template='plotly_white',
    showlegend=True,
    xaxis_title = 'Date',
    width=1200)


fig.update_traces(
    marker=dict(
        size=10),
    line=dict(
        width=3
    )
)


fig.show(include_plotlyjs=True)
#sns.relplot(x='date', y='sentiment_centered', hue='country', data=attn2, kind='line', aspect=2)
#plt.title('Dynamic Sentiment of Headlines about Borders in European Newspapers')
FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

How do I use the GPU in Google Colab if I need to get the sentiment for a lot of newspaper headlines?

Classifying the sentiment of the headlines can take a long time. For example, the ~5k German vaccines headlines took about 5 minutes. This poses a big challenge if we want to classify many more headlines.

One alternative is to use a GPU for the matrix multiplication that occurs billions of times when using transformers models. CPUs are not very good at this task, so using a GPU can be MUCH faster. Luckily, Google Colab allows free use of GPUs. Google Colab is probably the best way to run this entire notebook just because Python seems to be challenging for a lot of people to get install locally, so even if you don’t want to use a GPU, Google Colab might be the best option.

Google has a tutorial on how to use a GPU in Colab here. Once you enable the GPU, the only change you’ll need to make to the language is model is to add in device=0 when defining the classifier. This would look like the following:


sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=m1, device=0)  ## add in device=0 to use the GPU

instead of this:

sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=m1)  ## uses the slower CPU